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Abstract 

We report a major update of the MAFFT multiple sequence alignment program. This version has several new features, 
including options for adding unaligned sequences into an existing alignment, adjustment of direction in nucleotide 
alignment, constrained alignment and parallel processing, which were implemented after the previous major update. 
This report shows actual examples to explain how these features work, alone and in combination. Some examples 
incorrectly aligned by MAFFT are also shown to clarify its limitations. We discuss how to avoid misalignments, and 
our ongoing efforts to overcome such limitations. 
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Introduction 

Multiple sequence alignment (MSA) plays an important 
role in evolutionary analyses of biological sequences. 
MAFFT is an MSA program, first released in 2002 (Katoh 
et al. 2002). Because of its high performance (Nuin et al. 
2006; Golubchik et al. 2007; Dessimoz and Gil 2010; Letsch 
et al. 2010; Sahraeian and Yoon 2011; Sievers et al. 2011), 
MAFFT is becoming popular in recent years. After reviewing 
the previous version (version 6) in Katoh and Toh (2008b), 
we have been continuously improving its accuracy, speed, 
and utility in practical situations. These improvements and 
techniques were mostly reported in individual papers (Katoh 
et al. 2009; Katoh and Toh 2010; Katoh and Frith 2012; Katoh 
and Standley 2013). In this report, we demonstrate the 
different kinds of analyses that can be achieved with the 
new features, alone and in combination, using realistic exam- 
ples. We also discuss limitations of current version by giving 
examples of sequences incorrectly aligned by MAFFT, and 
describe our ongoing efforts to overcome these limitations. 

Basic Concepts and Usage 

As listed in table 1, MAFFT version 7 has options for various 
alignment strategies, including progressive methods 
(PartTree, FFT-NS-1, and WNS-1) (Feng and Doolittle 1987; 
Higgins and Sharp 1988; Katoh and Toh 2007), iterative 
refinement methods (FFT-NS-i, L-INS-i, E-INS-i, and G~INS-i) 
(Barton and Sternberg 1987; Berger and Munson 1991; Gotoh 
1993; Katoh et al. 2005), and structural alignment methods for 
RNAs (Q-INS'i and X-INS-i; Katoh and Toh 2008a). See Katoh 
and Toh (2008b) for details of these strategies. According to a 
recent comparative study based on the MetAI metric 
(Blackburne and Whelan 2012a, 2012b), there are two signif- 
icantly different classes of MSA methods, similarity-based 



methods and evolution-based methods. MAFFT is classified 
as a similarity-based method. However, evolutionary informa- 
tion is useful even for similarity-based methods, because the 
sequences to be aligned are generated from a common 
ancestor in the course of evolution. In this respect, MAFFT 
takes evolutionary information into account. 

All the options of MAFFT assume that the input sequences 
are all homologous, that is, descended from a common an- 
cestor. Thus, all the letters in the input data are aligned. 
Genomic rearrangement or domain shuffling is not assumed, 
and thus the order of the letters in each sequence is always 
preserved, although the sequences can be reordered accord- 
ing to similarity. Most options in MAFFT assume that almost 
all the pairs in the input sequences can be aligned, locally or 
globally. In such a situation, there is a tradeoff between accu- 
racy and speed. For example, the PartTree option (Katoh and 
Toh 2007) is a fast and rough method, whereas L-INS-i and 
G-INS-i are slower and more accurate. RNA structural align- 
ment methods are generally more accurate and computation- 
ally more expensive because they need additional calculations 
(Katoh and Toh 2008a). However, this tradeoff does not 
always hold. In particular, the new options to add sequences 
into an existing alignment (Katoh and Frith 2012), requires 
careful consideration of this tradeoff, as discussed later. 

Profile Alignments 

MAFFT has a subprogram, maf ft -profile, to align two 
existing alignments. 

maf f t-prof ile alignmentl alignment2 > output 

This method separately converts alignmentl and align- 
ment2 to profiles and then aligns the two profiles. It means 
that the two input alignments are assumed to be 
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£ maf f t-profile 



-addprof ile 




Alignment 1 

(converted to a profile) 



Alignment2 
(converted to a profile) 




Alignment 

(converted to a profile} 



Alignment2 
{ j — (not converted to profile) 



C Separate application of maf f t-profile 

[Useless in any situation) 



New sequences 




D --add, --addf ragmen ts 

New sequences 



Existing alignment 
(converted to a profile) 




Existing alignment 
(not converted to profile) 



Fig. 1. Assumptions on the phylogenetic relationship in different options of MAFFT. (A) maf ft -profile, (B) — addprofile, (C), misuse of 
maf ft -profile, and (D) — add or — addprofile. 



phylogenetically isolated from each other, like figure 1A. 
Careless application of this method results in serious misalign- 
ments, as discussed in later section. 

MAFFT version 7 has an alternative option, 
- -addprofile, which is safer against misuses. 

maf ft — addprofile alignmentl alignment2 > output 

This option accepts two existing alignments, alignmentl 
and alignment2, and assumes a phylogenetic relationship 
shown in figure 1B. That is, alignmentl is assumed to form 
a monophyletic cluster, but alignment2 is not assumed to 
form a monophyletic cluster. The cluster of alignmentl 
can be placed in any phylogenetic position in the tree of 
alignment2. Moreover, this option checks whether 
alignmentl forms a monophyletic cluster. If not, it returns 
an error message and asks user to use the - -add option (see 
the following section). 

Adding Unaligned Sequences into an MSA 

As a result of advances in sequencing technologies, we 
increasingly need MSAs consisting of a larger number of 
sequences. There are several different approaches to enable 
construction of large MSAs, such as rapid algorithms and 
parallelization. Here, we describe an alternate approach: use 
of an existing alignment. There already exist databases of 
carefully aligned and annotated sequences (Cole et al. 2009; 
Sigrist et al. 2010; Punta et al. 2012), in which each MSA 
consists of a small number (typically up to ~1,000) of se- 
quences. We can use such MSAs as a backbone to build a 
larger MSA containing newly sequenced data. This is more 
efficient than rebuilding the entire MSA from a set of 
ungapped sequences. Moreover, this approach is relatively 
robust to low-quality sequences resulting from sequencing 
errors, misassemblies, and other factors. Such noise usually 
has a negative effect on the quality of an MSA, but there are 



situations where biologically important information is con- 
tained in low-quality sequences. In such a case, we first select 
highly reliable sequences to build a backbone MSA, and then 
add the other sequences, including low-quality ones, into the 
MSA. As a result, the quality of the final MSA is less affected 
by the low-quality sequences. 

Inappropriate Applications of Profile Alignment 
The maf f t-profile program is not useful for this purpose. 
There are two types of misapplications. One is as follows: 
1) convert an existing alignment to a profile, 2) align new 
sequences and convert them to a profile, and 3) align the two 
profiles. This procedure is inappropriate for adding new se- 
quences because it assumes a phylogenetic relationship as 
illustrated in figure 1A. 

Another misapplication is as follows: 1) convert the 
existing alignment to a profile, 2) separately align each new 
sequence to the profile of the existing alignment, and 3) 
construct a full alignment from the individual alignments 
computed in the previous step. This approach is more 
reasonable than the first one but still problematic, because 
the phylogenetic positions of new sequences are assumed at 
the root of the tree, as illustrated in figure 1C. Results of this 
procedure for two cases are shown in table 2 and figure 2. 

The - -add and - -addf ragments Options 
To overcome this limitation of profile alignment, in 2010, we 
implemented an option, - -add, to add unaligned sequences 
to an existing MSA. This option assumes that each new se- 
quence was derived from a branch in the tree of an existing 
alignment, as illustrated in figure 1D. This option works 
almost identically to the standard progressive method, 
except that the alignment calculation is skipped at the 
nodes whose children are all in the existing alignment. 
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Step 1: L-INS-i for full-length alignment 
CPU time = 15 min 
Wall-clock time = 1 .0 min 

Step 2: —reorder — 6merpair —addfragments 
CPU time = 1.5 min 
Wall-clock time = 12 s 



Fig. 2. ITS alignments by different options of MAFFT, displayed on Jalview (Waterhouse et al. 2009). (A, B) Incorrect alignments by the FFT-NS-2 and 
L-INS-i algorithms, respectively. (C) An incorrect alignment by maf ft -profile. The full-length sequences were aligned with the L-INS-i algorithm 
and then each new sequence was separately added to the full-length alignment, using maf ft -profile. (D) Reasonable alignment by a two-step 
strategy. The — 6merpair — addfragments option was used at the second step. (£) Reordered version of D; sequences are ordered such that 
similar sequences are placed closely. All calculations were performed using 16 cores on a Linux PC with 2.67 GHz Intel Xeon E7-8837/256 GB RAM. 



Along with popularization of second-generation se- 
quencers, we sometimes need to align short reads to an ex- 
isting alignment. Several tools (Berger and Stamatakis 2011; 
Loytynoja et al. 2012; Sun and Buhler 2012) for this purpose 
were developed between 2011 and 2012. A limitation of the 

- -add option in MAFFT for this purpose was pointed out in 
Loytynoja et al. (2012). Thus, we implemented a new option, 

- -addfragments, which does not consider the relation- 
ship among the sequences to be added. Details of the - -add 
and - -addfragments options are described in Katoh and 
Frith (2012). 

Test Case 1: Fungal Internal Transcribed Spacers 
Sequences 

Here, we discuss how the - -addfragments option works, 
using an actual case. Internal transcribed spacers (ITSs) are 



spacer regions located between structural ribosomal RNAs. 
The structure of the rDNA region in a eukaryotic genome 
is 18S - ITS1 - 5.8S - ITS2 - 28S. Here, we use a data set 
consisting of ITS1 and ITS2 sequences obtained from envi- 
ronmental samples (Chen W, personal communication). Each 
sequence has either ITS1 or ITS2 region only, extracted from 
454 pyrosequencing data using Fungal ITSextractor (Nilsson et 
al. 2010). In addition, several fungal genomic sequences that 
fully cover ITS1 + 5.8S rRNA + ITS2 are available from public 
databases. 

Suppose a situation where we need an MSA of approxi- 
mately 300 full-length sequences and approximately 5,000 
ITS1 or ITS2 sequences. One possible solution is to build an 
entire MSA at once. The result of the default option (FFT- 
NS-2) of MAFFT is obviously incorrect, as shown in 
figure 2A. ITS1 and ITS2 regions are forced to be aligned 
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Table 2. Comparison of Different Options Using the 16S.B.ALL Data Set (Mirarab et al. 2012). 



Data 




Method 




Accuracy 


CPU Time 


Actual Time 3 


Case 1 


maf ft — multipair 


— addf ragmen t s 


frags exist ingmsa 


0.9969 


6.67 days 


18.3 h 




maf ft — 6merpair 


— addf ragment s 


frags existingmsa 


0.9949 


3.76 h 


36.2 min 




maf ft — localpair 


—add 


frags existingmsa 


0.9707 


23.4 days b 


2.43 days b 




maf ft — 6merpair 


—add 


frags existingmsa 


0.9604 


1.32 h 


1.44 h 




profile alignment 






0.2779 


15.5h 


1.60 h 


Case 2 


maf ft — 6merpair 


— addf ragment s 


frags existingmsa 


0.9969 


4.54 h 


33.8 min 


Case 3 


maf ft — 6merpair 


— addf ragment s 


frags existingmsa 


0.9949 


1.79 days 


5.91 h 



Note. — The estimated alignments were compared with the CRW alignment to measure the accuracy (the number of correctly aligned letters/the number of aligned letters in the 
CRW alignment). Calculations were performed on a Linux PC with 2.67 GHz Intel Xeon E7-8837/256 GB RAM (for the case marked with superscript alphabet "b"), or on a Linux 
PC with 3.47 GHz Intel Xeon X5690/48 GB RAM (for the other cases). 
Case 1: 13,822 sequences in the existing alignment x 13,821 fragments; 
Case 2: 1,000 sequences in the existing alignment x 138,210 fragments; 
Case 3: 13,822 sequences in the existing alignment x 138,210 fragments. 

a Wall-clock time with 10 cores. Command-line argument for parallel processing is — thread 10. 

b Full command-line options are as follows: maf ft — localpair — weighti 0 — add frags existingmsa. 



to each other. Even if a more computationally expensive (and 
usually more accurate) method, L-INS-i, is applied (CPU time- 
= 98 h), the alignment is still obviously incorrect (fig. 2B). 

Two-step strategies can solve this type of problem. That is, 
a set of full-length sequences taken from databases are first 
aligned to build a backbone MSA, and then the new ITS1 and 
ITS2 sequences are added into this backbone MSA, using the 
— addf ragments option. 

Step 1: maf ft - -auto full_length_sequences >\ 

backbone_msa 
Step 2: maf ft - -addf ragments \ new_sequences 

backbone_msa > output 

The second command is equivalent to 

maf ft — multipair — addfragments \ 

new_sequences backbone_msa > output 

in which Dynamic Programming (DP) is used to compare the 
distances between every new sequence and every sequence in 
the backbone MSA (- -multipair is selected by default). 

maf ft - -6merpair - -addfragments \ 

new_sequences backbone_msa > output 

where distances are rapidly estimated using the number of 
shared 6mers, instead of DP. 

The result of the latter option (--6merpair 

- -addfragments) is shown in fig. 2D and E. The difference 
between D and E is just in the order of sequences; the se- 
quences were reordered according to similarity using the 

- -reorder option in E. In this alignment, ITS1 and ITS2 
are clearly separated and aligned to appropriate positions in 
the full-length alignment. Moreover, this strategy is compu- 
tationally much less expensive (CPU time = 15 min [first 
step] + 1.5 min [second step]) than the full application 
of L-INS-i (CPU time = 98 h). The former option 
(- -multipair - -addfragments) also returns a similar 
result to the latter (--6merpair) but is slower (CPU 
time = 48.6 min [second step]). 

This case suggests that it is crucial to select a strategy 
appropriate to the problem of interest. The most time- 
consuming method, L-INS-i, is not always the most accurate 



one. The difficulty of this problem for standard approaches 
comes from the fact that ITS1 sequences and ITS2 sequences 
are not homologous to each other and most pairwise align- 
ments are impossible. Because of these nonhomologous pairs, 
the distance matrix used for the guide tree calculation is 
not additive; the distances between ITS1 and full-length 
sequences and those between ITS2 and full-length sequences 
are close to zero, whereas the distances between ITS1 and 
ITS2 are quite large. In this situation, it is difficult for normal 
distance-based tree-building methods to give a reasonable 
tree. Moreover, in the alignment step, the objective function 
of the L-INS-i is affected by inappropriate pairwise alignment 
scores between ITS1 and ITS2. Such problems can be avoided 
by just ignoring the relationship between ITS1 and ITS2, 
as done in the - -addfragments option. 

In addition, a result of the second type of misuse of 
maf ft -profile (discussed earlier) is shown in figure 2C. 
Some new sequences are correctly aligned but others are 
obviously incorrectly aligned (note that the order of se- 
quences in fig. 2C is identical that in fig. 2D). These misalign- 
ments are due to an incorrect assumption on phylogenetic 
placement of new sequences shown in figure 1C. 

Test Case 2: Bacterial SSU rRNA 

Another case is the 16S.B.ALL data set by Mirarab et al. (2012). 
It consists of an MSA of 13,822 bacterial SSU rRNA sequences, 
taken from the Gutell Comparative RNA Website (CRW) 
(Cannone et al. 2002) and 138,210 fragmentary sequences, 
which are originally included in the CRW alignment 
but ungapped and artificially truncated. In Katoh and 
Standley (2013), we used a subset (13,821 fragmentary 
sequences) prepared by Mirarab et al. (2012). In addition to 
this subset, here we use the full data set (138,210 fragmentary 
sequences), to examine the scalability. Suppose a situation 
where we already have a manually curated (or backbone) 
MSA and a newly determined set of many fragmentary 
sequences in a metagenomics project, and we need an 
entire MSA of them. 

The first four lines in table 2 (case 1) show the perfor- 
mances of various options for such an analysis, with a rela- 
tively small data set (13,822 sequences in the existing 
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alignment x 13,821 fragments). The accuracy of each 
resulting MSA was evaluated by comparing the MSA with 
the original CRW alignment. CPU time and wall-clock time 
for each method are also listed. As the sequences in this 
data set are highly conserved, the difference in accuracy be- 
tween the default (- -multipair - -addf ragments) and 
the faster option (- -6merpair - -addf ragments) is 
small. 

Again, the tradeoff between accuracy and speed does not 
hold. The application of a computationally expensive method 
based on L-INS-1 (- -localpair - -add) has no advan- 
tage, because the extra computational time is spent on the 
comparison of nonoverlapping fragmentary sequences, which 
have no reasonable solutions. 

The "profile alignment" line in table 2 shows results of the 
second type of misuse of profile alignment (discussed earlier), 
in which the given alignment is converted to a profile and 
each new sequence is separately aligned to the profile. This 
result clearly indicates that the application of profile align- 
ment must be avoided in this case, too. Users do not need to 
be too worried about this misuse, because this calculation is 
disabled in MAFFT unless the user modifies the code or writes 
a wrapper script. 

The last two lines in table 2 (Cases 2 and 3) show 
the performance of the fast option (--6merpair 

- -addf ragments) for a larger number (138,210) of frag- 
mentary sequences. The number of sequences in the existing 
alignment is 1,000 and 13,822 in cases 2 and 3, respectively. 
This fast option gives a reasonable quality of result in a 
reasonable computing time. At present, the default option 
( — multipair - -addf ragments) cannot handle cases 2 
and 3. Simulation-based benchmarks in Katoh and Frith 
(2012) suggested that, for cases with more divergent se- 
quences, the accuracy of the default option is higher than 
that of the fast option. We are now trying to improve the 
scalability of the default option. 

Parallelization 

MAFFT version 7 has an option for parallel processing, 

- -thread (Katoh and Toh 2010). This feature is currently 
supported on Mac OS X in addition to Linux, but not yet 
supported on Windows for technical reasons. With the 

- - thread n option, it runs in parallel with n threads. The 
number of threads can be automatically determined by 

- -thread -1. This option sets the number of threads as 
the number of physical cores, not the number of logical cores 
in Intel's hyperthreaded CPUs. 

For progressive methods, the result with the multithread 
version is identical to that of the serial processing version. 
However, for iterative refinement methods, the results are 
not always identical. We confirmed that the accuracy of the 
parallel version in this case is comparable with that of the 
serial version (Katoh and Toh 2010). The efficiency of paral- 
lelization depends on the alignment strategy. In the case 
of the - -addf ragments option, the efficiency is acceptably 
high as shown in table 2. 



Utility Options 

MAFFT version 7 also has several enhanced options for 
peripheral functions. 

Estimating the Direction of DNA Sequences 
In the case of nucleotide alignments, if some of input 
sequences have an incorrect direction relative to the other 
sequences, the directions can be automatically adjusted by 
the - -adjustdirection option. We use an algorithm 
with a time complexity of 0(n 2 ), where n is the number of 
sequences (Katoh and Standley 2013). It is slow when the 
distances are calculated with DP. However, when the distance 
is rapidly calculated based on the number of shared 6mers, 
the speed is reasonable. This option is also available on the 
web version, with the "Adjust direction" button. 

MAFFT cannot handle more complicated sequences with 
genomic rearrangements (translocations, duplications, or 
inversions). The web version of MAFFT displays dot plots 
between the first sequence and the remaining sequences, 
using the LAST local alignment program (Kielbasa et al. 
2011), for every nucleotide alignment run. By viewing the 
dot plots, a user can easily check for genomic rearrangements 
and the directions of input sequences. See Katoh and 
Standley (2013) for details and an example. 

Input/Output 

MAFFT version 7 has several enhancements in the flexibility 
of input/output. The following options related to input/ 
output are available and can be combined with other options. 

- -anysymbol If the input data include unusual let- 
ters, like U, J, etc., (in the case of protein data), MAFFT 
stops by default. The - -anysymbol option allows these 
letters and nonalphabetical letters. 

- -preservecase By default, amino acid sequences 
are converted to upper case and nucleotide sequences 
are converted to lower case. This behavior can be chan- 
ged by using the - -preservecase option. 

- -reorder The order of sequences is the same as the 
input sequences by default, but the sequences can be 
sorted according to similarity to each other by the 

- -reorder option. 

--phylipout and --clustalout The output 
format is multi-fasta by default, but the phylip (inter- 
leaved) format and the clustal format can be selected. 

Guide Tree and Phylogenetic Positions of New 
Sequences 

Users can check the guide tree by using the - -treeout 
option. In the case of - -addf ragments, the estimated 
phylogenetic positions of new sequences are shown together 
with the estimated tree of the existing alignment. The align- 
ment calculation is performed based on this phylogenetic 
estimation. It is also possible to compute such phylogenetic 
information only, without alignment, by the - -retree 0 
option. An example of output is shown in Figure 3A. 
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fa 104: newl04 

nearest sequence: 20 

approximate di s tance : 0.273000 

sister group: 10 77 21 97 20 83 

approximate distance: 0.275293 
120: newl20 

nearest sequence: 10 

approximate di s tance : 0.000000 

sister group: 10 

approximate di s tance : 0.000000 
182: newl82 

nearest sequence: 20 

approximate di s tance : 0.000000 

sister group: 20 

approximate di s tance : 0.000000 
310: new310 

nearest sequence: 20 

approximate distance : 0 . 022000 

sister group: 20 

approximate distance: 0.022000 
588: new588 

nearest sequence: 10 

approximate di s tance : 0.057000 

sister group: 10 77 21 97 

approximate di s tance : 0.057000 
593: new593 

nearest sequence: 10 

approximate distance: 0.160000 

sister group: 10 77 21 97 20 

approximate distance : 0 . 163300 

Fig. 3. (A) A part of output of the — treeout option showing the phylogenetic positions of new sequences (new#) in the tree of the exist- 
ing alignment (backbone#), estimated before the alignment calculation. This file also shows a Newick format tree of the existing alignment (not 
shown in this figure). For each new sequence, the nearest sequence in the existing alignment (nearest sequence), approximate distance to 
the nearest sequence (approximate distance), and the members of the sister group (sister group) are shown. (B) Graphical represen- 
tation of (A). 
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Note that this phylogenetic information is roughly esti- 
mated before the MSA calculation, not based on the MSA. 
Especially, with the fast option, - -6merpair, the estimation 
is very rough. With the - -multipair option (default), the 
estimation is expected to be better, but it needs a relatively 
long computational time. For more rigorous estimation of 
phylogenetic positions of new sequences, specially designed 
tools, such as pplacer (Matsen et al. 2010), PaPaRa (Berger and 
Stamatakis 2011), PAGAN (Loytynoja et al. 2012), SEPP 
(Mirarab et al. 2012), or combinations of them including 
MAFFT, should be tried. 

Parameters 

For amino acid alignment, MAFFT uses the BLOSUM62 
matrix by default. For nucleotide alignment, a 200PAM 
log-odds scoring matrix is generated assuming that the tran- 
sition rate is twice the transversion rate. These matrices are 
suitable for aligning distantly related sequences. We selected 
these default parameters based on an expectation that, if the 
program works well for difficult (distantly related) cases, it 
should also work well for easy cases. 

It is unclear whether this expectation is always correct. For 
example, in a benchmark using simulated protein sequences 
(Loytynoja et al. 2012) generated by INDELiBLE (Fletcher and 
Yang 2009), when we tested a more stringent scoring matrix, 
JTT 1PAM (Jones et al. 1992) with weaker gap penalties than 
the default, the benchmark scores were considerably im- 
proved. Despite this observation, we consistently used the 
default parameters in the benchmark in Katoh and Frith 
(2012), because it does not make sense to arbitrarily adjust 



parameters to a simulation setting. This observation suggests 
that the current default parameters of MAFFT might not 
be very suitable for aligning closely related sequences. 
However, this idea must be checked using actual biological 
sequences. 

User can select different scoring matrices other than the 
default. For amino acid alignment, - -bl 45, - -bl 62, - -bl 
80, - - j tt N, and - -tm N are accepted, where N is an ex- 
pected evolutionary distance among input sequences. The 
--bl, --jtt, and --tm options mean BLOSUM 
(Henikoff S and Henikoff JG 1992), JTT (Jones et al. 1992), 
and a transmembrane model (Jones et al. 1994), respectively. 
A user-defined scoring matrix can also be accepted, by - - 
aamatrix. For nucleotide alignments, - -kimura N is 
accepted, where N is an expected evolutionary distance 
among input sequences. Gap penalties can be adjusted by 
— op, — exp, — lop, and — lexp options. 

One possible extension is to use different scoring matrices 
and gap penalties for different sequence pairs according to 
the divergence level, like ClustalW (Thompson et al. 1994). 
More studies using actual sequence data will be necessary 
before implementing this extension. It will also be necessary 
to adjust gap penalties, preferably based on a realistic evolu- 
tionary model of insertions and deletions. 

Use of Structural Information 

We have discussed possible improvements in MSAs of closely 
related sequences in the previous section. MSA of distantly 
related sequences is still a challenging problem. 
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Fig. 4. (A) Superposition of 3v33, 2qip, and 1taq structures visualized by PyMOL (Schrodinger LLC 2010). (B) MAFFT-L-INS-i sequence alignment 
displayed on jalview (Waterhouse et al. 2009). Misaligned Ds are highlighed in red. (C) Structure-informed MSA with correctly aligned Ds; Alpha helices 
and beta sheets are shown in blue and yellow, respectively, in (A-C). 



Test Case 3: PIN Domain 

Figure 4 shows a typical limitation of sequence level alignment 
for a highly divergent set of three PIN-domain containing 
proteins: human regnase-1, VPA0982 from Vibrio parahaemo- 
lyticus, nuclease domain of taq polymerase from Thermus 
aquaticus. These three proteins share a magnesium-binding 
site composed of three conserved aspartic acids. Figure 4A 
shows a superposition of the three structures (Protein 
Databank identifiers 3v33, 2qip, and 1taq, respectively). The 
middle aspartic acid is indicated by sphere-representation, 
colored red. In Figure 4B, a typical MSA (by MAFFT-L-INS-i) 
is shown wherein the middle aspartic acid position is misa- 
ligned. In Figure 4C, a structure-informed MSA (described 
below), with the middle aspartic acid correctly aligned, is 
shown. 

Strategy for Integrating Structural Alignments and 
MAFFT 

It has long been known that structural information can be 
used to improve MSA calculations. This was the basis of 
the 3D Coffee program (O'Sullivan et al. 2004), and later 
the PROMALS3D package (Pei et al. 2008). Here, we address 
incorporation of protein structural information in MAFFT- 
based MSA construction. There are both conceptual issues 
and technical issues that complicate the process. 
Conceptually, we have to define structural similarity in such 
a way that it can easily be used in sequence alignments. 
We discuss our approach to this problem below in the con- 
text of integrating MAFFT with the structural alignment pro- 
gram ASH (Standley et al. 2004, 2007). On the technical level, 
structural information complicates matters simply because 
protein structures contain more information and more 
noise than sequence information. 

Here, we focus on one essential feature of ASH: the equiv- 
alence score that is used to define structural similarity. A 



particular element in the structural similarity matrix takes 
the form of a Gaussian-shaped function of the inter-residue 
distance 

e y = exp(-(d y /d 0 ) 2 ), 
where d f j is the distance between two alpha carbons / and j in 
the two input structures and d 0 is a parameter that defines 
tolerance in the score. The default behavior is to set d 0 to 4 A. 
The goal of ASH is to maximize the sum of e^ over aligned 
residues. The residue-level equivalences, which form the basis 
of all ASH alignments, provide a convenient route for com- 
bining MAFFT and ASH. We can, for example, set a threshold 
value of ejj and incorporate highly confident parts of the 
alignment into MAFFT to "seed" the MSA calculation. If we 
consider the case of the three PIN domain-containing struc- 
tures in Figure 4, we can first compute structural alignments 
for the three unique pairs using ASH (ash_3v33A-2qipA, 
ash_3v33A-1taqA, and ash_2qipA-1taqA). If we set a thresh- 
old for residue equivalence at >0.5, we can then combine the 
equivalence-filtered alignments into MAFFT using the seed 
option (Katoh et al. 2009): 

maf f t-linsi — seed ash_3v33A-2qipA \ 

- -seed ash_3v33A-ltaqA\ 

- -seed ash_2qipA-ltaqA \ 
sequences > output 

Because the sequence identities between the aligned struc- 
tures are low, we see an improvement in the resulting MSA 
relative to conventional MAFFT (Fig. 4). Based on this 
approach, we are developing an integrative service for protein 
structure-informed MSA construction. 
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